The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Wed Jun 10 21:59:58 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Wed Jun 10 21:59:58 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.9.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.9.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.9.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.9.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 140 140
Albania 140 140
Algeria 140 140
Andorra 140 140
Angola 140 140
Antigua and Barbuda 140 140
Argentina 140 140
Armenia 140 140
Australia 1120 1120
Austria 140 140
Azerbaijan 140 140
Bahamas 140 140
Bahrain 140 140
Bangladesh 140 140
Barbados 140 140
Belarus 140 140
Belgium 140 140
Belize 140 140
Benin 140 140
Bhutan 140 140
Bolivia 140 140
Bosnia and Herzegovina 140 140
Botswana 140 140
Brazil 140 140
Brunei 140 140
Bulgaria 140 140
Burkina Faso 140 140
Burma 140 140
Burundi 140 140
Cabo Verde 140 140
Cambodia 140 140
Cameroon 140 140
Canada 1960 1960
Central African Republic 140 140
Chad 140 140
Chile 140 140
China 4620 4620
Colombia 140 140
Comoros 140 140
Congo (Brazzaville) 140 140
Congo (Kinshasa) 140 140
Costa Rica 140 140
Cote d’Ivoire 140 140
Croatia 140 140
Cuba 140 140
Cyprus 140 140
Czechia 140 140
Denmark 420 420
Diamond Princess 140 140
Djibouti 140 140
Dominica 140 140
Dominican Republic 140 140
Ecuador 140 140
Egypt 140 140
El Salvador 140 140
Equatorial Guinea 140 140
Eritrea 140 140
Estonia 140 140
Eswatini 140 140
Ethiopia 140 140
Fiji 140 140
Finland 140 140
France 1540 1540
Gabon 140 140
Gambia 140 140
Georgia 140 140
Germany 140 140
Ghana 140 140
Greece 140 140
Grenada 140 140
Guatemala 140 140
Guinea 140 140
Guinea-Bissau 140 140
Guyana 140 140
Haiti 140 140
Holy See 140 140
Honduras 140 140
Hungary 140 140
Iceland 140 140
India 140 140
Indonesia 140 140
Iran 140 140
Iraq 140 140
Ireland 140 140
Israel 140 140
Italy 140 140
Jamaica 140 140
Japan 140 140
Jordan 140 140
Kazakhstan 140 140
Kenya 140 140
Korea, South 140 140
Kosovo 140 140
Kuwait 140 140
Kyrgyzstan 140 140
Laos 140 140
Latvia 140 140
Lebanon 140 140
Lesotho 140 140
Liberia 140 140
Libya 140 140
Liechtenstein 140 140
Lithuania 140 140
Luxembourg 140 140
Madagascar 140 140
Malawi 140 140
Malaysia 140 140
Maldives 140 140
Mali 140 140
Malta 140 140
Mauritania 140 140
Mauritius 140 140
Mexico 140 140
Moldova 140 140
Monaco 140 140
Mongolia 140 140
Montenegro 140 140
Morocco 140 140
Mozambique 140 140
MS Zaandam 140 140
Namibia 140 140
Nepal 140 140
Netherlands 700 700
New Zealand 140 140
Nicaragua 140 140
Niger 140 140
Nigeria 140 140
North Macedonia 140 140
Norway 140 140
Oman 140 140
Pakistan 140 140
Panama 140 140
Papua New Guinea 140 140
Paraguay 140 140
Peru 140 140
Philippines 140 140
Poland 140 140
Portugal 140 140
Qatar 140 140
Romania 140 140
Russia 140 140
Rwanda 140 140
Saint Kitts and Nevis 140 140
Saint Lucia 140 140
Saint Vincent and the Grenadines 140 140
San Marino 140 140
Sao Tome and Principe 140 140
Saudi Arabia 140 140
Senegal 140 140
Serbia 140 140
Seychelles 140 140
Sierra Leone 140 140
Singapore 140 140
Slovakia 140 140
Slovenia 140 140
Somalia 140 140
South Africa 140 140
South Sudan 140 140
Spain 140 140
Sri Lanka 140 140
Sudan 140 140
Suriname 140 140
Sweden 140 140
Switzerland 140 140
Syria 140 140
Taiwan* 140 140
Tajikistan 140 140
Tanzania 140 140
Thailand 140 140
Timor-Leste 140 140
Togo 140 140
Trinidad and Tobago 140 140
Tunisia 140 140
Turkey 140 140
Uganda 140 140
Ukraine 140 140
United Arab Emirates 140 140
United Kingdom 1540 1540
Uruguay 140 140
US 140 140
US_state 456540 456540
Uzbekistan 140 140
Venezuela 140 140
Vietnam 140 140
West Bank and Gaza 140 140
Western Sahara 140 140
Yemen 140 140
Zambia 140 140
Zimbabwe 140 140
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5213
Alaska 1018
Arizona 1286
Arkansas 5505
California 4902
Colorado 4727
Connecticut 763
Delaware 319
Diamond Princess 85
District of Columbia 86
Florida 5518
Georgia 12425
Grand Princess 86
Guam 86
Hawaii 442
Idaho 2514
Illinois 7248
Indiana 7205
Iowa 6780
Kansas 5694
Kentucky 8146
Louisiana 5190
Maine 1321
Maryland 2009
Massachusetts 1281
Michigan 6334
Minnesota 6046
Mississippi 6439
Missouri 7267
Montana 2248
Nebraska 4345
Nevada 1011
New Hampshire 906
New Jersey 1927
New Mexico 2228
New York 4903
North Carolina 7636
North Dakota 2697
Northern Mariana Islands 71
Ohio 6797
Oklahoma 5230
Oregon 2617
Pennsylvania 5338
Puerto Rico 86
Rhode Island 508
South Carolina 3743
South Dakota 3421
Tennessee 7368
Texas 15806
Utah 1175
Vermont 1215
Virgin Islands 86
Virginia 9672
Washington 3346
West Virginia 3596
Wisconsin 5210
Wyoming 1619
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 53949
China 140
Diamond Princess 121
Korea, South 111
Japan 110
Italy 108
Iran 105
Singapore 102
France 101
Germany 101
Spain 100
US 99
Switzerland 97
United Kingdom 97
Belgium 96
Netherlands 96
Norway 96
Sweden 96
Austria 94
Malaysia 93
Australia 92
Bahrain 92
Denmark 92
Canada 91
Qatar 91
Iceland 90
Brazil 89
Czechia 89
Finland 89
Greece 89
Iraq 89
Israel 89
Portugal 89
Slovenia 89
Egypt 88
Estonia 88
India 88
Ireland 88
Kuwait 88
Philippines 88
Poland 88
Romania 88
Saudi Arabia 88
Indonesia 87
Lebanon 87
Pakistan 87
San Marino 87
Thailand 87
Chile 86
Luxembourg 85
Peru 85
Russia 85
Ecuador 84
Mexico 84
Slovakia 84
South Africa 84
United Arab Emirates 84
Armenia 83
Colombia 83
Croatia 83
Panama 83
Serbia 83
Taiwan* 83
Turkey 83
Argentina 82
Bulgaria 82
Latvia 82
Uruguay 82
Algeria 81
Costa Rica 81
Dominican Republic 81
Hungary 81
Andorra 80
Bosnia and Herzegovina 80
Jordan 80
Lithuania 80
Morocco 80
New Zealand 80
North Macedonia 80
Vietnam 80
Albania 79
Cyprus 79
Malta 79
Moldova 79
Brunei 78
Burkina Faso 78
Sri Lanka 78
Tunisia 78
Ukraine 77
Azerbaijan 76
Ghana 76
Kazakhstan 76
Oman 76
Senegal 76
Venezuela 76
Afghanistan 75
Cote d’Ivoire 75
Cuba 74
Mauritius 74
Uzbekistan 74
Cambodia 73
Cameroon 73
Honduras 73
Nigeria 73
West Bank and Gaza 73
Belarus 72
Georgia 72
Bolivia 71
Kosovo 71
Kyrgyzstan 71
Montenegro 71
Congo (Kinshasa) 70
Kenya 69
Niger 68
Guinea 67
Rwanda 67
Trinidad and Tobago 67
Paraguay 66
Bangladesh 65
Djibouti 63
El Salvador 62
Guatemala 61
Madagascar 60
Mali 59
Congo (Brazzaville) 56
Jamaica 56
Gabon 54
Somalia 54
Tanzania 54
Ethiopia 53
Burma 52
Sudan 51
Liberia 50
Maldives 48
Equatorial Guinea 47
Cabo Verde 45
Sierra Leone 43
Guinea-Bissau 42
Togo 42
Zambia 41
Eswatini 40
Chad 39
Tajikistan 38
Haiti 36
Sao Tome and Principe 36
Benin 34
Nepal 34
Uganda 34
Central African Republic 33
South Sudan 33
Guyana 31
Mozambique 30
Yemen 26
Mongolia 25
Mauritania 22
Nicaragua 22
Malawi 16
Syria 16
Zimbabwe 14
Bahamas 13
Libya 13
Comoros 11
Suriname 3
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 140
Korea, South 111
Japan 110
Italy 108
Iran 105
Singapore 102
France 101
Germany 101
Spain 100
US 99
Switzerland 97
United Kingdom 97
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-03-01        18322
## 2    Afghanistan           <NA> <NA> 2020-02-13        18305
## 3    Afghanistan           <NA> <NA> 2020-02-29        18321
## 4    Afghanistan           <NA> <NA> 2020-05-08        18390
## 5    Afghanistan           <NA> <NA> 2020-02-12        18304
## 6    Afghanistan           <NA> <NA> 2020-02-28        18320
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                     1     0.00000000
## 2                      0                     0            NaN
## 3                      0                     1     0.00000000
## 4                    109                  3778     0.02885124
## 5                      0                     0            NaN
## 6                      0                     1     0.00000000
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  0.000000                       -Inf        18348
## 2                      -Inf                       -Inf        18348
## 3                  0.000000                       -Inf        18348
## 4                  3.577262                   2.037426        18348
## 5                      -Inf                       -Inf        18348
## 6                  0.000000                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1            -26  NA   NA         NA                           NA
## 2            -43  NA   NA         NA                           NA
## 3            -27  NA   NA         NA                           NA
## 4             42  NA   NA         NA                           NA
## 5            -44  NA   NA         NA                           NA
## 6            -28  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Grocery_Pharmacy 0.9988048 -34.0
Hawaii Retail_Recreation 0.9966539 -56.0
New Hampshire Parks 0.9539163 -20.0
Vermont Parks 0.9222119 -35.5
Maine Transit -0.9156101 -50.0
Utah Residential -0.8915005 12.0
Connecticut Grocery_Pharmacy -0.8873027 -6.0
South Dakota Parks 0.8291543 -26.0
Utah Transit -0.8208656 -18.0
Hawaii Parks 0.8123312 -72.0
Wyoming Parks -0.7865584 -4.0
Arizona Grocery_Pharmacy -0.7671026 -15.0
Rhode Island Workplace -0.7648509 -39.5
Alaska Workplace -0.7623320 -33.0
Hawaii Transit 0.7616438 -89.0
Massachusetts Workplace -0.7519018 -39.0
Connecticut Transit -0.7476110 -50.0
Alaska Grocery_Pharmacy -0.6637043 -7.0
Vermont Grocery_Pharmacy -0.6571589 -25.0
New York Workplace -0.6510191 -34.5
Hawaii Residential -0.6479960 19.0
Wyoming Transit -0.6407420 -17.0
Utah Parks -0.6395308 17.0
Alaska Residential 0.6299525 13.0
Rhode Island Retail_Recreation -0.6264114 -45.0
Maine Workplace -0.6050637 -30.0
North Dakota Parks 0.6043531 -34.0
New Jersey Parks -0.6032475 -6.0
Rhode Island Residential -0.6030669 18.5
Arizona Residential 0.5947354 13.0
Utah Workplace -0.5918574 -37.0
New York Retail_Recreation -0.5887457 -46.0
Nevada Transit -0.5848343 -20.0
Nebraska Workplace 0.5832840 -32.0
Arizona Retail_Recreation -0.5778725 -42.5
New Jersey Workplace -0.5460603 -44.0
New York Parks 0.5293175 20.0
Idaho Residential -0.5225136 11.0
Connecticut Residential 0.5122214 14.0
Connecticut Retail_Recreation -0.5106185 -45.0
Massachusetts Retail_Recreation -0.4979914 -44.0
Maine Parks 0.4913224 -31.0
Kansas Parks 0.4876324 72.0
New Jersey Grocery_Pharmacy -0.4857417 2.5
Montana Parks -0.4787407 -58.0
New Mexico Grocery_Pharmacy -0.4784741 -11.0
Montana Residential 0.4759730 14.0
West Virginia Parks 0.4758243 -33.0
Connecticut Workplace -0.4740261 -39.0
Nebraska Residential -0.4686972 14.0
Iowa Parks -0.4629253 28.5
Rhode Island Parks 0.4526207 52.0
North Dakota Retail_Recreation -0.4521241 -42.0
New Mexico Residential 0.4505197 13.5
New Jersey Retail_Recreation -0.4393207 -62.5
Maryland Workplace -0.4369250 -35.0
Illinois Transit -0.4333163 -31.0
Pennsylvania Workplace -0.4253312 -36.0
New Mexico Parks 0.4226177 -31.5
South Carolina Workplace 0.4214674 -30.0
New Jersey Transit -0.4088464 -50.5
Massachusetts Grocery_Pharmacy -0.4074297 -7.0
Wyoming Workplace -0.4072068 -31.0
Arizona Transit 0.4056412 -38.0
New Hampshire Residential -0.4040777 14.0
Maryland Grocery_Pharmacy -0.4017899 -10.0
Vermont Residential 0.3978518 11.5
Alabama Grocery_Pharmacy -0.3952578 -2.0
Pennsylvania Retail_Recreation -0.3841317 -45.0
Alabama Workplace -0.3765957 -29.0
Oregon Workplace -0.3731658 -31.0
New York Transit -0.3694462 -48.0
Kentucky Parks -0.3693475 28.5
Hawaii Workplace 0.3629922 -46.0
Michigan Parks 0.3617934 30.0
Idaho Grocery_Pharmacy -0.3545210 -5.5
Wisconsin Transit -0.3533451 -23.5
New Mexico Retail_Recreation -0.3529529 -42.5
Idaho Workplace -0.3492358 -29.0
Arkansas Parks 0.3438453 -12.0
Nebraska Grocery_Pharmacy 0.3424183 -0.5
California Transit -0.3404156 -42.0
Oregon Parks -0.3348131 16.5
California Residential 0.3253257 14.0
Wyoming Grocery_Pharmacy -0.3241197 -10.0
North Dakota Workplace 0.3181998 -40.0
Arkansas Retail_Recreation -0.3181385 -30.0
Virginia Transit -0.3177891 -33.0
Maryland Retail_Recreation -0.3150650 -39.0
Idaho Transit -0.3140108 -30.0
Maine Retail_Recreation -0.3038203 -42.0
California Parks -0.3030799 -38.5
Minnesota Transit -0.2960611 -28.5
Illinois Workplace -0.2909200 -30.5
Montana Transit 0.2900356 -41.0
Alaska Retail_Recreation 0.2883929 -39.0
Nevada Residential 0.2871329 17.0
North Carolina Grocery_Pharmacy 0.2871149 0.0
Mississippi Residential 0.2859399 13.0
Vermont Retail_Recreation 0.2858287 -57.0
Pennsylvania Parks 0.2848743 12.0
Colorado Residential 0.2836404 14.0
Florida Residential 0.2774242 14.0
Missouri Residential -0.2766350 13.0
Alabama Transit -0.2730372 -36.5
North Carolina Workplace 0.2720769 -31.0
Texas Workplace 0.2669533 -32.0
Rhode Island Grocery_Pharmacy 0.2634588 -7.5
Georgia Grocery_Pharmacy -0.2603165 -10.0
Maryland Residential 0.2580160 15.0
Texas Residential -0.2572952 15.0
Kansas Workplace 0.2556432 -32.5
Rhode Island Transit -0.2528446 -56.0
Pennsylvania Grocery_Pharmacy -0.2524454 -6.0
South Carolina Parks -0.2523647 -23.0
Illinois Parks 0.2492472 26.5
Tennessee Workplace -0.2482144 -31.0
Vermont Workplace -0.2431951 -43.0
West Virginia Grocery_Pharmacy -0.2402761 -6.0
Florida Parks -0.2333037 -43.0
Tennessee Residential 0.2311994 11.5
New York Grocery_Pharmacy -0.2302796 8.0
Wisconsin Parks 0.2193851 51.5
Nevada Retail_Recreation -0.2162014 -43.0
Arkansas Residential 0.2131342 12.0
South Dakota Workplace 0.2123763 -35.0
Georgia Workplace -0.2105623 -33.5
Montana Workplace -0.2095925 -40.0
North Carolina Transit 0.2095334 -32.0
North Dakota Grocery_Pharmacy -0.2067820 -8.0
Illinois Residential 0.2056484 14.0
Washington Grocery_Pharmacy 0.2050153 -7.0
Idaho Retail_Recreation -0.2043347 -39.5
Utah Retail_Recreation -0.2020378 -40.0
Michigan Workplace -0.2013326 -40.0
Georgia Retail_Recreation -0.2011285 -41.0
North Carolina Residential 0.2005892 13.0
Minnesota Parks 0.1996084 -9.0
West Virginia Workplace 0.1971850 -33.0
Kansas Grocery_Pharmacy -0.1953829 -14.0
Oregon Residential 0.1951433 10.5
Mississippi Grocery_Pharmacy -0.1950848 -8.0
Colorado Parks -0.1886578 2.0
Virginia Residential 0.1864286 14.0
Wisconsin Residential -0.1794217 14.0
Oregon Transit 0.1791476 -27.5
Connecticut Parks 0.1766471 43.0
Arizona Parks -0.1755355 -44.5
Texas Parks 0.1752972 -42.0
Kentucky Grocery_Pharmacy 0.1693979 4.0
Pennsylvania Transit -0.1691295 -42.0
New Jersey Residential 0.1672977 18.0
Kentucky Transit -0.1664271 -31.0
Massachusetts Parks 0.1627647 39.0
Nebraska Transit -0.1623863 -9.0
Missouri Workplace 0.1623314 -28.0
New Mexico Transit 0.1618125 -38.5
Indiana Residential 0.1611995 12.0
New Hampshire Retail_Recreation -0.1592690 -41.0
Nevada Workplace -0.1589099 -40.0
Ohio Transit 0.1565556 -28.0
Iowa Transit 0.1547735 -24.0
Indiana Parks -0.1542673 29.0
California Grocery_Pharmacy -0.1512850 -11.5
South Dakota Residential 0.1467175 15.0
Indiana Retail_Recreation 0.1464244 -38.0
Virginia Grocery_Pharmacy -0.1437632 -8.0
Alabama Parks 0.1432982 -1.0
Florida Retail_Recreation 0.1420857 -43.0
California Retail_Recreation -0.1354536 -44.0
Mississippi Retail_Recreation -0.1335219 -40.0
Michigan Retail_Recreation -0.1309573 -53.0
Minnesota Retail_Recreation 0.1288315 -40.0
Wisconsin Workplace -0.1282461 -31.0
Georgia Residential -0.1278194 13.0
North Dakota Transit 0.1249145 -48.0
Washington Workplace -0.1246362 -38.0
Missouri Transit -0.1239481 -24.5
Mississippi Transit -0.1221416 -38.5
Oregon Grocery_Pharmacy 0.1212899 -7.0
Texas Grocery_Pharmacy 0.1197658 -14.0
South Dakota Transit -0.1176008 -40.0
North Dakota Residential -0.1174540 17.0
Tennessee Parks -0.1171558 10.5
Wyoming Residential 0.1131114 12.5
Wisconsin Grocery_Pharmacy 0.1114205 -1.0
Kentucky Residential 0.1096770 12.0
Virginia Parks 0.1096504 6.0
Nebraska Retail_Recreation 0.1071303 -36.0
Massachusetts Transit -0.1070010 -45.0
Idaho Parks 0.1065571 -22.0
Arkansas Workplace -0.1041190 -26.0
Kansas Transit -0.1039958 -26.5
Missouri Parks 0.1032659 0.0
New Hampshire Grocery_Pharmacy -0.1023689 -6.0
Nevada Parks 0.1017500 -12.5
Alabama Retail_Recreation 0.1013152 -39.0
Maryland Transit -0.1008308 -39.0
Oklahoma Grocery_Pharmacy -0.1004824 -0.5
Massachusetts Residential 0.0989548 15.0
Michigan Transit 0.0968468 -46.0
Texas Transit 0.0962167 -41.0
Ohio Residential 0.0957931 14.0
North Carolina Retail_Recreation 0.0939504 -34.0
Ohio Parks -0.0937042 67.5
Washington Residential 0.0917827 13.0
New York Residential 0.0914003 17.5
Michigan Residential 0.0899615 15.0
North Carolina Parks -0.0888777 7.0
Iowa Workplace 0.0887354 -30.0
Minnesota Workplace -0.0874466 -33.0
Georgia Parks 0.0869309 -6.0
Indiana Workplace 0.0866900 -34.0
Virginia Workplace -0.0858091 -31.5
Mississippi Parks -0.0847705 -25.0
Oklahoma Workplace 0.0842762 -31.0
Pennsylvania Residential 0.0812088 15.0
California Workplace -0.0793035 -36.0
Virginia Retail_Recreation -0.0780586 -35.0
Utah Grocery_Pharmacy 0.0777671 -4.0
Oklahoma Parks 0.0759811 -18.5
Ohio Grocery_Pharmacy 0.0732673 0.0
Minnesota Grocery_Pharmacy 0.0730386 -6.0
Maine Residential -0.0724066 11.0
Minnesota Residential -0.0721573 17.0
Iowa Grocery_Pharmacy -0.0719906 4.0
Oregon Retail_Recreation -0.0716936 -40.5
Florida Grocery_Pharmacy 0.0716264 -14.0
Washington Transit -0.0705954 -33.5
Kentucky Retail_Recreation 0.0696881 -29.0
Colorado Transit 0.0690322 -36.0
Texas Retail_Recreation 0.0651583 -40.0
Michigan Grocery_Pharmacy -0.0636828 -11.0
Alabama Residential -0.0622725 11.0
Indiana Grocery_Pharmacy -0.0582904 -5.5
South Carolina Transit 0.0537854 -45.0
Washington Parks 0.0530342 -3.5
Ohio Retail_Recreation 0.0526585 -36.0
West Virginia Residential -0.0522755 11.0
Oklahoma Residential 0.0518057 15.0
West Virginia Transit -0.0509177 -45.0
Wyoming Retail_Recreation -0.0491461 -39.0
South Carolina Grocery_Pharmacy 0.0481651 1.0
New Hampshire Workplace 0.0478305 -37.0
Mississippi Workplace -0.0475717 -33.0
South Carolina Residential -0.0471154 12.0
Kentucky Workplace -0.0464000 -36.5
Indiana Transit 0.0459120 -29.0
Illinois Grocery_Pharmacy -0.0426129 2.0
Vermont Transit -0.0425239 -63.0
Ohio Workplace -0.0358474 -35.0
Florida Transit -0.0333435 -49.0
Maine Grocery_Pharmacy -0.0330937 -13.0
South Dakota Retail_Recreation -0.0325004 -39.0
Wisconsin Retail_Recreation 0.0305603 -44.0
Arkansas Transit 0.0287536 -27.0
Colorado Grocery_Pharmacy -0.0273857 -17.0
Florida Workplace -0.0270690 -33.0
Missouri Retail_Recreation -0.0251286 -36.5
Georgia Transit -0.0228392 -35.0
Arizona Workplace -0.0224694 -35.0
New Mexico Workplace 0.0217712 -34.0
Tennessee Grocery_Pharmacy 0.0217489 6.0
Illinois Retail_Recreation 0.0215125 -40.0
Iowa Retail_Recreation 0.0207721 -38.0
Montana Grocery_Pharmacy -0.0200463 -16.0
Colorado Retail_Recreation -0.0192782 -44.0
Iowa Residential -0.0192644 13.0
Colorado Workplace 0.0172558 -39.0
Arkansas Grocery_Pharmacy -0.0164697 3.0
Kansas Residential -0.0158527 13.0
West Virginia Retail_Recreation -0.0150510 -38.5
Tennessee Transit 0.0147399 -32.0
Oklahoma Transit 0.0145138 -26.0
Montana Retail_Recreation 0.0131230 -51.0
Oklahoma Retail_Recreation 0.0120544 -31.0
Tennessee Retail_Recreation -0.0116769 -30.0
Kansas Retail_Recreation -0.0091933 -38.0
Maryland Parks 0.0078458 27.0
Missouri Grocery_Pharmacy 0.0068602 2.0
Nebraska Parks 0.0062342 55.5
New Hampshire Transit -0.0053081 -57.0
Nevada Grocery_Pharmacy 0.0020969 -12.5
South Carolina Retail_Recreation -0.0020929 -35.0
South Dakota Grocery_Pharmacy -0.0010509 -9.0
Washington Retail_Recreation -0.0002791 -42.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Wed Jun 10 22:01:21 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net